{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcddce7b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# download presidio\n",
    "!pip install presidio_analyzer presidio_anonymizer\n",
    "\n",
    "!python -m spacy download en_core_web_lg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3345f1c4",
   "metadata": {},
   "source": [
    "###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a71c2409",
   "metadata": {},
   "source": [
    "# Anonymizing known values\n",
    "\n",
    "In addition to statistical and pattern based approaches, Presidio also supports the identification and anonymization of known values, using the deny-list mechanism. In this example we'll cover two cases:\n",
    "1. The known values are known a-priori (e.g., we have a list of names)\n",
    "2. The known values are only known in the context of a request (e.g., we have the name of a person as the filename)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33008ea7",
   "metadata": {},
   "source": [
    "## Example 1: values are known a-priori\n",
    "\n",
    "Assume you have a list of potential PII values, you can create a recognizer which would detect them every time they appear in the text. For this case, we can create a deny-list based recognizer, and add it to presidio's `RecognizerRegistry`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3d1e9cc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer\n",
    "from presidio_anonymizer import AnonymizerEngine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dc54fc31",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get known values as a deny-list\n",
    "known_names_list = [\"George\", \"Abraham\", \"Theodore\", \"Bill\", \"Barack\", \"Donald\", \"Joe\"]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b926f0a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a PatternRecognizer for the deny list\n",
    "deny_list_recognizer = PatternRecognizer(supported_entity=\"PRESIDENT_FIRST_NAME\", deny_list=known_names_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f5448009",
   "metadata": {},
   "outputs": [],
   "source": [
    "registry = RecognizerRegistry()\n",
    "registry.add_recognizer(deny_list_recognizer)\n",
    "\n",
    "analyzer = AnalyzerEngine(registry=registry)\n",
    "\n",
    "anonymizer = AnonymizerEngine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ec9e24e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Identified entities:\n",
      "[type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0]\n",
      "\n",
      "Anonymized text:\n",
      "<PRESIDENT_FIRST_NAME> Washington was the first US president\n"
     ]
    }
   ],
   "source": [
    "text=\"George Washington was the first US president\"\n",
    "\n",
    "results = analyzer.analyze(text=text, language=\"en\")\n",
    "\n",
    "print(\"Identified entities:\")\n",
    "print(results)\n",
    "print(\"\")\n",
    "anonymized = anonymizer.anonymize(text=text, analyzer_results=results)\n",
    "print(f\"Anonymized text:\\n{anonymized.text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "510f532b",
   "metadata": {},
   "source": [
    "## Example 2: values are only known in the context of the request\n",
    "\n",
    "In some cases, we know the potential PII values only in the context of a specific text. Examples could be:\n",
    "1. Detect PII entities in free text columns in tabular databases, where other columns have entity values we can leverage\n",
    "2. Detect PII in a file having the filename or other metadata holding potential PII values (e.g. Smith.csv)\n",
    "3. Anonymize medical images which contain metadata\n",
    "4. Anonymize financial forms when the actual PII data is known\n",
    "\n",
    "In this case we can use a functionality called ad-hoc recognizers. Here's a simple example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d57a02d6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Martin Smith',\n",
       "  'special_value': '145A',\n",
       "  'free_text': 'Martin Smith, id 145A, likes playing basketball'},\n",
       " {'name': 'Deb Schmidt',\n",
       "  'special_value': '256B',\n",
       "  'free_text': 'Deb Schmidt, id 256B likes playing soccer'},\n",
       " {'name': 'R2D2',\n",
       "  'special_value': 'X1T2',\n",
       "  'free_text': \"X1T2 is R2D2's special value\"}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person1 = {\"name\": \"Martin Smith\", \n",
    "           \"special_value\":\"145A\", \n",
    "           \"free_text\": \"Martin Smith, id 145A, likes playing basketball\"}\n",
    "person2 = {\"name\":\"Deb Schmidt\", \n",
    "           \"special_value\":\"256B\", \n",
    "           \"free_text\": \"Deb Schmidt, id 256B likes playing soccer\"}\n",
    "person3 = {\"name\":\"R2D2\", \n",
    "           \"special_value\":\"X1T2\", \n",
    "           \"free_text\": \"X1T2 is R2D2's special value\"}\n",
    "\n",
    "dataset = [person1, person2, person3]\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cec6cb31",
   "metadata": {},
   "source": [
    "We're interested in anonymizing the free text using the values contained in `name` and `special_value`. Since these values are only available in the context of one record, we use the ad-hoc recognizer capability in Presidio, instead of a generic deny-list `PatternRecognizer` added to Presidio's `RecognizerRegistry`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b342fc25",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<name>, id <special_value>, likes playing basketball\n",
      "<name>, id <special_value> likes playing soccer\n",
      "<special_value> is <name>'s special value\n"
     ]
    }
   ],
   "source": [
    "# Go over dataset\n",
    "for person in dataset:\n",
    "    \n",
    "    # Get the different known values\n",
    "    name = person['name']\n",
    "    special_val = person['special_value']\n",
    "    \n",
    "    # Get the free text to anonymize\n",
    "    free_text = person['free_text']\n",
    "    \n",
    "    # Create ad-hoc recognizers\n",
    "    ad_hoc_name_recognizer = PatternRecognizer(supported_entity=\"name\", deny_list = [name])\n",
    "    ad_hoc_id_recognizer = PatternRecognizer(supported_entity=\"special_value\", deny_list = [special_val])\n",
    "    \n",
    "    # Run the analyze method with ad_hoc_recognizers:\n",
    "    analyzer_results = analyzer.analyze(text=free_text, \n",
    "                                        language=\"en\", \n",
    "                                        ad_hoc_recognizers=[ad_hoc_name_recognizer, ad_hoc_id_recognizer])\n",
    "    \n",
    "    # Anonymize results\n",
    "    anonymized = anonymizer.anonymize(text=free_text, analyzer_results=analyzer_results)\n",
    "    print(anonymized.text)\n",
    "    \n",
    "    # Store output in original dataset\n",
    "    person[\"anonymized_free_text\"] = anonymized.text\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "bd347adf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Martin Smith',\n",
       "  'special_value': '145A',\n",
       "  'free_text': 'Martin Smith, id 145A, likes playing basketball',\n",
       "  'anonymized_free_text': '<name>, id <special_value>, likes playing basketball'},\n",
       " {'name': 'Deb Schmidt',\n",
       "  'special_value': '256B',\n",
       "  'free_text': 'Deb Schmidt, id 256B likes playing soccer',\n",
       "  'anonymized_free_text': '<name>, id <special_value> likes playing soccer'},\n",
       " {'name': 'R2D2',\n",
       "  'special_value': 'X1T2',\n",
       "  'free_text': \"X1T2 is R2D2's special value\",\n",
       "  'anonymized_free_text': \"<special_value> is <name>'s special value\"}]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dataset now contains the anonymiezd version as well\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39534a7c",
   "metadata": {},
   "source": [
    "Note that in these examples we're only using the custom recognizers we created. We can also add our custom recognizers to the existing recognizers in presidio, by calling `registry.load_predefined_recognizers()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7abaea43",
   "metadata": {},
   "outputs": [],
   "source": [
    "registry = RecognizerRegistry()\n",
    "\n",
    "# Load existing recognizer\n",
    "registry.load_predefined_recognizers()\n",
    "\n",
    "# Add our custom one\n",
    "registry.add_recognizer(deny_list_recognizer)\n",
    "\n",
    "# Initialize AnalyzerEngine\n",
    "analyzer = AnalyzerEngine(registry=registry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5800fcc6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0,\n",
       " type: PERSON, start: 0, end: 17, score: 0.85,\n",
       " type: LOCATION, start: 45, end: 62, score: 0.85]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "analyzer.analyze(\"George Washington was the first president of the United States\", language=\"en\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe2d811e",
   "metadata": {},
   "source": [
    "Since George is also a name, it was detected twice, once as a PERSON entity, and once as a custom entity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ad4649c",
   "metadata": {},
   "source": [
    "Read more:\n",
    "\n",
    "- For more info on Presidio Analyzer, see [this documentation](https://microsoft.github.io/presidio/analyzer/)\n",
    "- For more info on Presidio Anonymize, see [this documentation](https://microsoft.github.io/presidio/anonymizer/)\n",
    "- To further customize the anonymization type, see [this tutorial](https://microsoft.github.io/presidio/tutorial/11_custom_anonymization/)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51da9e15",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "presidio",
   "language": "python",
   "name": "presidio"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
